Goto

Collaborating Authors

 dual kernel





comments. 2 Response to Reviewer

Neural Information Processing Systems

We thank all the reviewers for their time and valuable feedback. The purpose of Thm 3.2 is to clarify how the kernel We tend to think it as "the kernel whose kernel norm equals our kernel Meanwhile, we believe that we can develop results similar to our Corollary 3.3 to explicitly clarify the concrete relation We will discuss this extensively in the revision. The properties of the empirical loss are not shown. We agree with the reviewer's comments on the issue of biasedness But the anslysis for the non-IID case is quite technical, and can be a distraction of this work's main focus. Therefore, we prefer to study it in a separate work that focuses on statistical guarantees and uncertainty estimation. They plays orthogonal roles, so it is not easy to say which is more important. "Kernel-Based Reinforcement Learning" is not the same as the more general kernel methods used in the paper and other It is mentioned in the paper as a related work, and we will make the distinction explicit. Thm 3.2 is meant to clarify how the kernel Bellman loss is related to the error We will consider to reform Theorem 3.2 into a "Dual kernel Fig2(d) is similar, but plot the ( Bellman-error, K-loss) and ( Bellman-error, L2-loss) pairs.


Fast Neural Kernel Embeddings for General Activations

Han, Insu, Zandieh, Amir, Lee, Jaehoon, Novak, Roman, Xiao, Lechao, Karbasi, Amin

arXiv.org Artificial Intelligence

Infinite width limit has shed light on generalization and optimization aspects of deep learning by establishing connections between neural networks and kernel methods. Despite their importance, the utility of these kernel methods was limited in large-scale learning settings due to their (super-)quadratic runtime and memory complexities. Moreover, most prior works on neural kernels have focused on the ReLU activation, mainly due to its popularity but also due to the difficulty of computing such kernels for general activations. In this work, we overcome such difficulties by providing methods to work with general activations. First, we compile and expand the list of activation functions admitting exact dual activation expressions to compute neural kernels. When the exact computation is unknown, we present methods to effectively approximate them. We propose a fast sketching method that approximates any multi-layered Neural Network Gaussian Process (NNGP) kernel and Neural Tangent Kernel (NTK) matrices for a wide range of activation functions, going beyond the commonly analyzed ReLU activation. This is done by showing how to approximate the neural kernels using the truncated Hermite expansion of any desired activation functions. While most prior works require data points on the unit sphere, our methods do not suffer from such limitations and are applicable to any dataset of points in $\mathbb{R}^d$. Furthermore, we provide a subspace embedding for NNGP and NTK matrices with near input-sparsity runtime and near-optimal target dimension which applies to any \emph{homogeneous} dual activation functions with rapidly convergent Taylor expansion. Empirically, with respect to exact convolutional NTK (CNTK) computation, our method achieves $106\times$ speedup for approximate CNTK of a 5-layer Myrtle network on CIFAR-10 dataset.


Learning from Conditional Distributions via Dual Embeddings

Dai, Bo, He, Niao, Pan, Yunpeng, Boots, Byron, Song, Le

arXiv.org Machine Learning

Many machine learning tasks, such as learning with invariance and policy evaluation in reinforcement learning, can be characterized as problems of learning from conditional distributions. In such problems, each sample $x$ itself is associated with a conditional distribution $p(z|x)$ represented by samples $\{z_i\}_{i=1}^M$, and the goal is to learn a function $f$ that links these conditional distributions to target values $y$. These learning problems become very challenging when we only have limited samples or in the extreme case only one sample from each conditional distribution. Commonly used approaches either assume that $z$ is independent of $x$, or require an overwhelmingly large samples from each conditional distribution. To address these challenges, we propose a novel approach which employs a new min-max reformulation of the learning from conditional distribution problem. With such new reformulation, we only need to deal with the joint distribution $p(z,x)$. We also design an efficient learning algorithm, Embedding-SGD, and establish theoretical sample complexity for such problems. Finally, our numerical experiments on both synthetic and real-world datasets show that the proposed approach can significantly improve over the existing algorithms.